Feature:3963 Step HeartBeat components #4073

Json-Andriopoulos · 2025-10-20T13:54:56Z

Backend heartbeat support (DB, API)
Heartbeat monitoring worker

Describe changes

I implemented/fixed _ to achieve _.

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
IMPORTANT: I made sure that my changes are reflected properly in the following resources:
- ZenML Docs
- Dashboard: Needs to be communicated to the frontend team.
- Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
- Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

Json-Andriopoulos · 2025-10-20T14:01:08Z

src/zenml/steps/heartbeat.py

+                        _thread.interrupt_main()  # raises KeyboardInterrupt in main thread
+                    # Ensure we stop our own loop as well.
+                    self._running = False
+                except Exception:


TODO: Improve this. For sure try to capture HTTP errors in more verbose logs to avoid excessive log generation if the error is for instance server raising 500 status code.

Json-Andriopoulos · 2025-10-27T08:54:18Z

Questions/Comments for reviewers @schustmi @bcdurak :

Log info records

I see that in general core components (StepLauncher, StepRunner, etc.) we display a very small number of log records. For better visibility during development I have some log records in the heartbeat worker, should these be removed? I am assuming we display few systemic logs to avoid polluting the user experience as they would be interested in their step function logs only? Some follow-up recommendations would be - a) use structured logs with context variables (https://www.structlog.org/en/stable/) to easily filter records by metadata values b) introduce a systemic logger that is configurable. Suppressed by default, when activated it would present all systemic logs.

Handling of constants

Currently heartbeat interval is hard set as a class variable for the StepHeartBeatWorker cls. For sure I don't want to expose this to user-provided settings as this should be a system setting (too frequent heartbeats from multiple steps may end-up overloading the rest server). I believe a good value would be somewhere in the range of 30-60 seconds. Where would you organize this value? Under config/constants.py? In a config object?

Interrupt implementation

I went over our signals/daemonize implementations. While that would be the proper implementation for any unix-based system it is not compatible with Windows. I opted to use _thead.interrupt_main() instead which raises a KeyboardInterrupt exception by default, capture it with a context manager that reraises it with a custom exception. Let me know your thoughts.

src/zenml/constants.py

schustmi · 2025-10-29T04:54:00Z

src/zenml/utils/exception_utils.py

+        self._target_exception = target_exception
+        self._message = message


I guess to simplify this, we could just pass an instance of the exception here instead of the class and message? That would additionally also allow some exceptions which can/need to be instantiated with multiple arguments.

schustmi · 2025-10-29T04:55:26Z

src/zenml/models/v2/core/step_run.py

+    """Light-weight model for Step Heartbeat responses."""
+
+    id: UUID
+    status: str


This should probably be of type ExecutionStatus?

schustmi · 2025-10-29T05:00:50Z

src/zenml/steps/heartbeat.py

+                            "interrupting main thread",
+                            self.name,
+                        )
+                        _thread.interrupt_main()  # raises KeyboardInterrupt in main thread


My dynamic pipelines PR introduces running multiple steps in different threads, which doesn't work with this I think.

Can we somehow store the thread from which the heartbeat worker was started, and then interrupt that thread instead of the main one?

Yeah that is an important change, good point. interrupt_main will not work here, we will need to change the pattern a bit. Should I work my changes from your branch?

schustmi · 2025-10-29T05:03:43Z

src/zenml/zen_server/routers/steps_endpoints.py

+    step = zen_store().get_run_step(step_run_id, hydrate=True)
+    pipeline_run = zen_store().get_run(step.pipeline_run_id)
+    verify_permission_for_model(pipeline_run, action=Action.UPDATE)


I'm wondering whether this RBAC check is even necessary, as running all of this will take quite some time (two calls to the DB, then a request to the RBAC service).

Is there any real harm in leaving this unprotected? I guess it would allow users potential access to the status of the step, which I'm not sure really is a concern.

True, we can probably do both authenication & authorization with pipeline tokens. Will discuss with @stefannica for directions.

Let me suggest an alternative: we could limit this endpoint to only be accessed by running pipelines.

Running pipelines (the containerized environment where the steps are running actually) use something called "a workload API token" which is only valid as long as the pipeline run itself is not yet finalized. These workload API tokens are tied to a particular pipeline run (or schedule, in case of scheduled pipelines). So we can also use their scope to limit the range of targets that they can update.

Some references:

this is the code that verifies the pipeline scoped tokens (you can see some leeway is involved): https://github.com/zenml-io/zenml/blob/main/src/zenml/zen_server/auth.py#L406-L475

same thing for the schedule-scoped tokens: https://github.com/zenml-io/zenml/blob/main/src/zenml/zen_server/auth.py#L363-L404

A sketch of how you can use this in your endpoint:

def update_heartbeat( step_run_id: UUID, auth_context: AuthContext = Security(authorize), ) -> StepHeartbeatResponse: ... if not auth_context.access_token or not auth_context.access_token.schedule_id and not auth_context.access_token.pipeline_run_id: raise AuthorizationException("Not authorized") if auth_context.access_token.pipeline_run_id: # optionally, check that the step ID is part of this run ID else: # if auth_context.access_token.schedule_id # optionally, check that the step ID is part of a run ID that was scheduled with this schedule

This will no longer rely on RBAC calls, but it might still flood the database with a lot of requests, so maybe you could also implement a mini-caching system like the ones used in the previous code references, to reduce its impact.

@stefannica That was my initial idea as well, but do we use those tokens also when running pipelines with service accounts? I thought at some point we used the API key directly when running scheduled pipelines, but I might be misremembering.

I know for sure though that there is a way to generate a generic unscoped token instead of a workload token when running a pipeline (by setting some token expiration env variable), so we'll have to think about how we handle this case.

@schustmi yes, even when running pipelines with service accounts, we generate workload API tokens scoped to the pipeline run or schedule. The only case where we use a generic unscoped token is if you set the ZENML_PIPELINE_API_TOKEN_EXPIRATION env variable. But this is a very obscure case, which I don't think we need to handle separately. In that case, we can just not run any RBAC like checks on this endpoint.

The problem is that this is a client-side env variable and I'm not sure we can recognize this case in the server endpoint that receives the heartbeat requests. It will just be a generic token that is not scoped to the run, and if we allow those to call the endpoint without any checks then everyone can do so, no?

- Backend heartbeat support (DB, API) - Heartbeat monitoring worker

schustmi · 2025-11-05T09:05:23Z

src/zenml/zen_server/routers/steps_endpoints.py

+                ctx.access_token.schedule_id, hydrate=True
+            )
+
+            if pipeline_run.pipeline.id != schedule.pipeline_id:


Maybe instead of this we can check pipeline_run.schedule.id == ctx.access_token.schedule_id, to make sure the run is actually triggered by the schedule that the token is scoped to?

Nice, missed that. Yes, that's better 1 DB call shorter.

schustmi · 2025-11-05T09:07:28Z

src/zenml/orchestrators/step_launcher.py

+        # Since interrupt_main raises KeyboardInterrupt we want in this context to capture it
+        # and handle it as a custom exception.
+
+        with ContextReraise(


Just a question for clarification: If someone actually interrupts the python process with CTRL-C, that raises a KeyboardInterrupt in the main thread I assume?

It would be captured the same way and a misleading exception would appear I would think. But is that a scenario we need to worry about?

I guess I personally do it quite a lot which is why I thought about it, not sure how common/important it is though. I guess it's quite easy to solve though, so maybe we can account for it?

It could just be a boolean flag on the heartbeat worker that signals that it interrupted the main thread, and we can use that flag to decide whether to re-raise with the HeartbeatInterrupt exception or keep the original KeyboardInterrupt?

Yeap, that flag already exists in the heartbeat it is .is_terminated. Easy to do, will cover that as well.

schustmi · 2025-11-05T09:10:18Z

src/zenml/orchestrators/step_launcher.py

+                f"Initiating heartbeat for step: {self._invocation_id}"
+            )
+
+            StepHeartbeatWorker(step_id=step_run.id).start()


I'm guessing you don't stop the worker here and instead rely on the server response for the thread to shut down automatically?

Wouldn't it be better to add an explicit stop once the user code finished executing?

Yeap makes sense to add it. Won't make much of a difference now but it is safer and more future-proof.

- Updates migration down revision refs - context-reraise exception - changes in the step-heartbeat logic - fix null heartbeat in list/get endpoints

github-actions bot added the enhancement New feature or request label Oct 20, 2025

Json-Andriopoulos commented Oct 20, 2025

View reviewed changes

Json-Andriopoulos requested a review from schustmi October 20, 2025 14:03

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from 123c1c4 to 18d8b76 Compare October 27, 2025 07:29

Json-Andriopoulos requested a review from bcdurak October 27, 2025 07:56

bcdurak linked an issue Oct 27, 2025 that may be closed by this pull request

Implement a step heartbeat function #3963

Open

1 task

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from 18d8b76 to 1352677 Compare October 27, 2025 08:55

schustmi requested changes Oct 29, 2025

View reviewed changes

Feature:3963 Step HeartBeat components

6a15b66

- Backend heartbeat support (DB, API) - Heartbeat monitoring worker

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from 3db4503 to 6d077e6 Compare November 5, 2025 08:39

Json-Andriopoulos added the run-slow-ci label Nov 5, 2025

Json-Andriopoulos marked this pull request as ready for review November 5, 2025 08:39

Json-Andriopoulos requested a review from schustmi November 5, 2025 08:39

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch 3 times, most recently from 54b63a9 to d416056 Compare November 5, 2025 08:53

schustmi requested changes Nov 5, 2025

View reviewed changes

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from d416056 to 1d19a9f Compare November 5, 2025 10:50

Json-Andriopoulos requested a review from schustmi November 5, 2025 10:55

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from 1d19a9f to 1649b45 Compare November 5, 2025 11:17

fixup! Improvements and bug fixes

c40b9bf

- Updates migration down revision refs - context-reraise exception - changes in the step-heartbeat logic - fix null heartbeat in list/get endpoints

Json-Andriopoulos force-pushed the feature/3963-step-run-heartbeat branch from 1649b45 to c40b9bf Compare November 5, 2025 15:34

		self._target_exception = target_exception
		self._message = message

Feature:3963 Step HeartBeat components #4073

Are you sure you want to change the base?

Feature:3963 Step HeartBeat components #4073

Conversation

Json-Andriopoulos commented Oct 20, 2025

Describe changes

Pre-requisites

Types of changes

Uh oh!

Json-Andriopoulos Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Json-Andriopoulos commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Json-Andriopoulos Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Json-Andriopoulos Oct 20, 2025 •

edited

Loading

Json-Andriopoulos commented Oct 27, 2025 •

edited

Loading

Json-Andriopoulos Oct 29, 2025 •

edited

Loading